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Abstract 

In this paper we consider the problem of estimating simultaneously low-rank and row-wise 
sparse matrices from nested linear measurements where the linear operator consists of the prod¬ 
uct of a linear operator W and a matrix 'P. Leveraging the nested structure of the measurement 
operator, we propose a computationally efficient two-stage algorithm for estimating the simul¬ 
taneously structured target matrix. Assuming that W is a restricted isometry for low-rank 
matrices and •P' is a restricted isometry for row-wise sparse matrices, we establish an accuracy 
guarantee that holds uniformly for all sufficiently low-rank and row-wise sparse matrices with 
high probability. Furthermore, using standard tools from information theory, we establish a 
minimax lower bound for estimation of simultaneously low-rank and row-wise sparse matrices 
from linear measurements that need not be nested. The accuracy bounds established for the 
algorithm, that also serve as a minimax upper bound, differ from the derived minimax lower 
bound merely by a polylogarithmic factor of the dimensions. Therefore, the proposed algorithm 
is nearly minimax optimal. We also discuss some applications of the proposed observation model 
and evaluate our algorithm through numerical simulation. 


1 Introduction 


In this paper we study the problem of estimating a matrix X* that is simultaneously low-rank and 
row-wise sparse from the noisy linear measurements 

y = A{X^) + z, (1) 

where A is a known linear operator and 2 models the additive error or noise. In particular, we 
consider this problem in the high-dimensional regime where the dimension of the measurements y 
can be much smaller than the ambient dimension of the target X*. Estimation of low-rank matrices 
from the linear measurements Q has been studied extensively in various settings (see, e.g., m, 
|36) . and |23]). However, in cases where the target matrix is characterized by two or more structures 
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simultaneously, the research regarding efficient and statistically optimal estimation is less developed. 
Ideally, different structures of the target matrix can be exploited to improve the sample complexity. 

In this paper, assuming that the measurement operator A has a nested structure as described below, 
we address the problem of estimating the simultaneously low-rank and row-wise sparse JC* from the 
linear measurements Q. Throughout the paper we may refer row-wise sparse matrices simply as 
sparse matrices unless explicitly stated otherwise. The nested operator A is the product of a linear 
operator and a matrix which we assume to be restricted (near) isometries for low-rank and row¬ 
wise sparse matrices, respectively. The inner matrix component in A compresses the columns of the 
target matrix without destroying its low-rank structure. This compression can be inverted because 
the target has sparse columns. The outer linear operator in A exploits the low-rank structure 
in the compressed matrix to compress the dimensions even more. We propose a computationally 
efficient method that relies on the nested structure of the measurement operator. We show that the 
proposed method is accurate uniformly for all simultaneously low-rank and row-wise sparse targets. 
Therefore, the obtained accuracy bound can be interpreted as an upper bound for the minimax 
rate as well. Furthermore, using standard information theoretic tools we establish a minimax lower 
bound that holds for any well-bounded linear operator A, nested or otherwise. As will be seen 
below, the derived minimax bounds differ only by a polylogarithmic factor of the dimensions which 
conhrm (near) optimality of the proposed estimation method. 

1.1 Notation 

Let us hrst set the notation used throughout the paper. Matrices and vectors are denoted by bold 
capital and small letters, respectively. The Hermitian adjoint of linear operators and matrices is 
denoted by a superscript asterisk such as Ad* and M*. The set of positive integers less than or 
equal to n is denoted by [n]. For any a x b matrix M and a set S C [b], the a x |5| restriction of 
M to the columns indexed by S is denoted by Ms- In particular, Ia,s denotes the restriction of 
the a X a identity matrix to the columns indexed by S. The all-zero and all-one matrices of size 
a X b are denoted by 0^x6 and 1^x6) respectively. The Kronecker product of matrices Mi and M 2 
is denoted by Mi (8) M 2 - The orthogonal complement of any subspace V is denoted by V'^- The 
maximum and minimum of two numbers a and b are denoted by a V 6 and a Ab, respectively. The 
notation f = 0 (g) is used when f = eg for some absolute constant c > 0. We may also use f g 
(or f ^ g) to denote f < eg (or / > eg) for some absolute constant c > 0. The largest integer that 
is less than or equal to a real number a is denoted by [aj. Similarly, [a] denotes the smallest integer 
greater than or equal to a. The function dn (•,•) denotes the natural Hamming distance between 
a pair of objects that can be represented uniquely by hnite binary sequences of the same length 
(e.g., binary vectors, binary matrices, subsets of a hnite set, etc). For any matrix M, the number 
of nonzero rows, the Frobenius norm, the nuclear norm, and the sum of the row-wise ^ 2 -iiorms (i.e., 
the £i^ 2 -iiorm) are denoted by ||AT||g 2 , ||iVf||j 7 ., ||AT||^, and ||iW||^ 2 ) respectively. 

1.2 Problem setup 

We would like to estimate a pi x p 2 matrix X* from noisy linear measurements given by Q where 
the measurement operator A : —)> M” is assumed to have the nested form 

A{X) = W{’^X), (2) 
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with the matrix ^ G ]^™-xpi and the linear operator W : known. We also assume 

throughout the paper that z in Q is a Gaussian noise with N (Onxi,distribution. The esti¬ 
mation problem is assumed to be in the high-dimensional regime (i.e., n <C PiP 2 )- We consider the 
target matrices to have rank at most r <C A p 2 and at most A: <C pi nonzero rows, where clearly 
r < k. The set of all such matrices can be explicitly dehned as 

Xk,r ■■= [x E I ||X||o 2 < k, rank(X) < r| , 

where ||X||g 2 denotes the number of nonzero rows of X. We denote the set of matrices with rank at 
most r by X,^r and the set of matrices with at most k nonzero rows by .. Furthermore, we assume 
that the rank parameter r = rank {X*) and the noise variance ci^ are known to the estimator. 

The accuracy guarantees of our method can be treated as a minimax upper bound, if they hold 
uniformly for all target matrices in To derive the desired uniform accuracy guarantees we 

assume that ^ is a restricted isometry for sparse matrices and W is a restricted isometry for 
low-rank matrices which are common assumptions for compressive sensing and low-rank matrix 
estimation laiMiiis]. To be precise, for any integer k E [pi], ^ is a said to be a restricted isometry 
over if there exists a restricted isometry constant 6k,, {^) E [0,1] such that 

VX E Xfc,. : (1 - 4 ,. W) ||X||^ < ||«Z^X||^ < (1 + 6k,. i^)) \\X\\l . 

Clearly, any restricted isometry for A-sparse vectors is a restricted isometry over Xk,, and vice versa. 
Similarly, for all integers r E [mAp 2 ]) VV is said to be the restricted isometry over X,,r if there 
exists a restricted isometry constant 6,,r (W) E [0,1] such that 

VX EX.,, : (l-4,r(W))||X||^ < ||W(X)||2 < (l + 4,r(W))||X||^. 


The goal of our minimax approach is to characterize the tolerance level r in terms of the parameters 
of the estimation problem, such that the “minimax failure probability” given by 


mf sup P 

X x*eXk,r 


X - X" 



is sufficiently small. In words, r would be the worst-case accuracy of the best estimator that 
holds with high probability. As will be seen in Section we use the restricted isometry properties 
mentioned above to provide a uniform accuracy guarantee for our proposed method which also 
serves as a minimax upper bound. To establish the minimax lower bound, we employ some of the 
standard information theoretic tools that help to determine non-trivial values of r for which the 
above minimax failure probability is signihcant, e.g., more than Derivation of the minimax lower 
bound does not depend on the restricted isometry assumptions and we merely assume that 

VXEXfc,, : P(X)||2 <7;,_,||X||^, (3) 

for some constant 7 ^,, > 0 . Of course, if A is dehned by Q and we happen to have the restricted 
isometries ^ and W as above, clearly there exists a constant 7 ^,, such that 7 ^,, < (1 -|- 6,,r (W)) (1 -|- 4,. (^))- 


1.3 Contributions 

• Minimax upper bound: 

Under the observation model described by Q and Q we propose a method that 
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With n = 0 [r {my P 2 )) and under the assumption that W and ^ are restricted 
isometries over rank-Ar matrices (i.e., and row-wise 2k-sparse matrices (i.e., 

respectively, produces some estimate X of X* such that 


X -X^ 


< 


a 


\/rJrriVp2), 


holds uniformly for all X* G ^k,r- 


In particular, we show that if ^ and W are Gaussian with properly scaled iid entries then 

With m = 0 {k log and n = 0 (r (m V P 2 )) the desired restricted isometry prop¬ 
erties hold and the produced estimate obeys 


X - X* 


< 


ax r 


{k log ^ V P 2 ), 


for all X* G ^k,r 


Because the above accuracy results hold uniformly for all targets in they can also be 
viewed as a minimax upper bound. 


• Minimax lower bound: 

We also establish a minimax lower bound merely under the assumption that the linear mea¬ 
surement operator A satishes ^ whether it takes the nested form (|^ or not. Namely, we 
show that 


If A obeys ^ then for a sufficiently small absolute constant c > 0 and for any 
estimator X there exists a target X* G ^k,r such that 


X - X^ 


> ca\ 


'klog^ -\-r{ky P 2 ) 


holds with probability at least h. 


With ’jk,r = 0 (1), the obtained minimax lower and upper bounds differ only by some logarithmic 
factors, which shows that by exploiting the nested form of the measurements the proposed esti¬ 
mator achieves a near optimal minimax rate. For comparison, if we only considered the low-rank 
structure, then for any estimator the estimation error in Frobenius norm would have been bounded 
by 0 (fx^^r {pi V ^ 2 )^ from below (cf. 1371 and |13l Theorem 2.5]). Similarly, considering only the 
sparsity as the structure of X*, the accuracy lower bounds for sparse estimation (e.g..|35[ Theorem 
1]) suggest the lower bound 0 {a^p 2 k log for the estimation error. Therefore, with the natural 
assumption that pi ^ P 2 , these lower bounds indicate that exploiting the low-rank and sparse struc¬ 
tures simultaneously has reduced the estimation error dramatically. Furthermore, in the noise-free 
scenario, our results suggest that we can recover X* exactly from 0 (r (/slog ^ VP 2 )) measure¬ 
ments. Therefore, the nested structure of the measurements and the proposed estimation procedure 
have a critical role in achieving the mentioned sample complexity which cannot be achieved by 
minimization of any convex combination of the nuclear norm and the £ 1 , 2 -norm for many common 
types of linear measurements |34| . 
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1.4 Related work 


Many important problems in statistics and machine learning such as Sparse Principal Component 
Analysis (SPCA) can be formulated as estimation of simultaneously low-rank and sparse matrices. 
These estimation problems are significantly more challenging than estimation of matrices that are 
either low-rank or sparse. In sparse spiked eovariance estimation j22| and more generally SPCA, 
given a relatively small number of independent samples of a mean-zero multivariate Gaussian, the 
primary goal is to estimate the principal eigenvectors of the associated covariance matrix that are 
assumed to be sparse. Many convex and non-convex algorithms have been proposed for these 
problems including but not limited to [I71|22l[Il[IIl 11311111 HU]. Each of these algorithms provides a 
different trade-off between statistical optimality and computational complexity. However, as shown 
in |7|, there may be an inevitable gap between the optimal statistical accuracy and the statistical 
accuracy that polynomial-time algorithms can achieve in detection of sparse principal components. 
Since SPCA can be viewed as a primitive for solving linear regression with simultaneously low-rank 
and sparse target matrices, the result of |7| can also indicate the difficulty of the latter problem. 

The problem of denoising low-rank and row-wise sparse matrices is also considered in [9]. The 
iterative algorithm proposed in [9] is basically an adaptation of the power iteration where the factors 
are sparsified through thresholding. It is shown, that under mild conditions, this iterative method 
can be initialized appropriately and then is shown to be nearly minimax optimal. A more related 
work to our problem is multiple linear regression considered in |28] where the goal is to estimate X* 
that is low-rank and row-wise sparse from the measurements Y = AX* -|- Z with A and Z being 
the design and the noise matrices, respectively. The algorithm of |28| uses the low-rank structure of 
X* to construct an initial estimate V of the right singular vectors of X*. Using this estimate, the 
algorithm then solves a least squares with a convex regularization for row-wise sparsity to estimate 
B, the projection of X* onto the estimated singular vectors. Then V is updated to be the right 
singular vectors of Y after projecting its range onto the range of B. Finally, the algorithm updates 
B by repeating the mentioned convex optimization and outputs BV^ as the estimate of X* which 
is shown to be nearly minimax optimal. 

Furthermore, an efficient alternating minimization method for estimation of rank-one and sparse 
matrices from linear measurements is proposed in |2^. The algorithm is shown to be accurate and 
nearly optimal in terms of sample complexity, provided that the factors of the target rank-one matrix 
have relatively dominant spikes and the noise level is moderate. Estimation of simultaneously struc¬ 
tured matrices from compressive linear measurements is also studied in |34| . It is shown in |34| that 
using convex proxies for each of the assumed structures independently would result in a suboptimal 
sample complexity. Specifically, it is shown jUJ that minimization of any mixture of the £i^ 2 -iiorm 
and the nuclear norm fails to recover the true signal if there are less than 0 {kp 2 A r (pi -|- p 2 )) mea¬ 
surements. Therefore, our results as stated above show a significant improvement over the naive 
convex relaxation method studied in |34) . 
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Main Results 


2.1 Two-stage estimator 


We propose the following two-stage method for estimation of simultaneously low-rank and row-wise 
sparse matrices from measurements given by Q when A has the nested structure described by ([^ . 
The nuclear norm and the sum of row-wise £ 2 -iiorm are denoted by H-H^ and IHI]^ 2 ’ respectively. 


1. Low-rank estimation stage: 


B G argmin ||.B| 
B 


(4) 


subject to IIW (B) — yW^ < n -|- cir (m V P 2 ) 


2. Sparse estimation stage: 


X G argmin ||X||^ 2 

X 


(5) 


subject to 


- B 


< C2(T-v/r7mVp2y 


With appropriately chosen constants ci and C 2 each stage is a convex program for which many 
efficient solvers exist. 


Variations of the estimator: The specific form of the optimization that expresses the low-rank 
estimation and the sparse estimation stages is not critically important. It is only necessary to 
produce a solution that is sufficiently accurate in each stage. For example, with the appropriate 
regularization parameter A > 0, we could have used the regularized analog of (Q given by 

B G argmin ^ ||>V (B) - y\\l -h A ||B||^ 

B 2, 

which also enjoys the desired accuracy guarantees |13) . Similarly, the sparse estimation stage can 
be performed by the regularized form of (|^. Furthermore, because we are assuming that ^ and 
W are restricted isometries, the convex programs considered in (Q and (|^ could be replaced by 
non-convex greedy algorithms such as the iterative hard thresholding (IHT) |8] that achieve the same 
accuracy usually at a lower computational cost. These methods, however, usually require tighter 
bounds on the restricted isometry constants. 


Post-processing: The solutions obtained by Q and ([^ generally are not low-rank or sparse. 
To enforce the desired structures, after each stage of the estimator we can simply project the 
estimator onto the set of low-rank and/or row-wise sparse matrices. It is straightforward to show 
that these post-processing steps will not change the derived error bounds beyond a constant factor. 
Furthermore, we can treat range of the best rank-r approximation to B as an estimate for the range 
of ^X* and pass it to a sparse estimation stage similar to Q, but with an optimization variable 
that has r columns rather than p 2 - Therefore, we can significantly reduce the computational cost of 
the second stage of the estimator. However, to analyze the performance of this modified estimator 
it is necessary to convert the range estimation to the actual estimation error which we do not pursue 
in this paper. 
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2.2 Accuracy of the estimator and the minimax upper bound 


In this section we state our result on the statistical accuracy of the two-stage estimator described 
by 0 and 0. This accuracy guarantee can be viewed as a minimax upper bound as well. 

For 2 : ~ N (Onxi) tail bounds for chi-squared random variables |25[ Lemma 1] guarantee that 
for any > 0 we have 

{n — v) < \\z \\2 < {n + u) 


with probability at least 1 — 2 exp A Therefore, assuming that n = 0 (r (m V P 2 )) and 

choosing 12 = cir (m V P 2 ) for a sufficiently large absolute constant ci > 0 we have 


(n — cir (m V P 2 )) < \\z\\\ < {n + cir (m V P 2 )), 


( 6 ) 


with high probability. Therefore, we can guarantee that the matrix B* = is in the feasible 

set of 0 with high probability. Consequently, we can invoke Lemma and guarantee that X* is 
in the feasible set of 0 for appropriate choice of the constant C 2 > 0. Of course, the constants ci 
and C 2 depend on the restricted isometry constants of W and 


Theorem 1 (minimax upper bound). Let n = 0 {r (m V ^ 2 ))- Furthermore, suppose that ^ and W 
have sufficiently small restricted isometry constants 6 ,^ 2 k {^) cind (W), respectively. Then, for 
appropriately chosen constants ci and C 2 , there exists an absolute constant C > 0 depending on ci 
and C 2 such that the estimate produced using 0 and 0 obeys 


X -X* 


<Ca^ [m V P 2 ) 


for all X* G ^k,r with high probability. 


The proof of Theoremj^is straightforward and provided in the appendix. If ^ and W are drawn from 
certain ensembles of random matrices or operators (e.g., Gaussian, Rademacher, partial Fourier, 
partial random circulant, etc.), then with 

m = 0 (fcpolylog (pi, A:)) and n = 0 (r (A: Vp 2 ) polylog (pi,p 2 ) A:, r)), 


where polylog () denotes a polylogarithmic factor of its argument, we can guarantee the desired re¬ 
stricted isometry properties with high probability. Therefore, an immediate consequence of Theorem 
[^in these scenarios is that we can guarantee estimation error of the order \/r {kV P 2 ) polylog {pi,P 2 , A:, r). 
To have a concrete example, the case of Gaussian operators is addressed by the following corollary. 


Corollary 1. Suppose that ^ has iid N (O, A) entries. Furthermore, suppose that W is a Gaussian 
operator that simply takes the inner product of its argument with n independent Gaussian matrices 
each populated with iid N (O, entries. If m = Ciklog^ and n = C 2 r{m\/ P 2 ) for sufficiently 
large absolute constants Ci and (72, Ahen there exists an absolute constant (7 > 0 such that the 
estimate X obtained using (0 and (0) satisfies 


X -X* 


< Ca 


(^A: log ^ V P 2 ), 


for all X* G with high probability. 
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Proof. With m = 0 (/clog^) it can be shown through the standard covering argument and the 
union bound that, with high probability, the considered ^ would be a restricted isometry over 
2fc-sparse vectors in [5]. Clearly, in this case ^ is also a restricted isometry over X 2 fc,. as it 
preserves the £ 2 -iiorm of the columns of any matrix in 'K 2 k,.- Similarly, W would be a restricted 
isometry over rank-r matrices in if n = 0 (r (m V P 2 )) |13[ Theorem 2.3]. Then, the desired 

accuracy bound follows immediately from Theorem □ 


Weaker assumptions and non-uniform guarantees: The assumptions that the linear operator 
W and the matrix ^ are restricted isometries are primarily used to establish the minimax bounds 
that are valid uniformly for all target matrices X* G ^k,r- However, these assumptions are not 
generally needed if the goal is merely to show accuracy of the proposed two-stage method for any 
particular instance of the problem. For example, for certain random operators W and matrices ^ 
that cannot be restricted isometries, the accuracy of Q and (Q can be shown by construction of a 
dual certificate through the golfing scheme |20) . In these regimes, however, the robustness to noise 
is often weaker. 


Low-rank and column-sparse matrices: If the target low-rank matrices are column-sparse 
rather than row-wise sparse our results still apply with minor adjustments, if the columns of a low- 
rank matrix are /c-sparse, but not necessarily supported on the same rows, the restricted isometry 
of W still holds. Therefore, we can follow a similar algorithm in which the fi ^2 norm in the second 
stage is replaced by another norm such as the maximum column-wise norm. The lower bound 
still holds as well since row-wise sparse matrices are a special instances of column-sparse matrices. 


2.3 The minimax lower bound 


The following theorem provides a minimax lower bound for the probability of estimation failure 
over 


Theorem 2 (minimax lower bound). Suppose that A obeys |^. Let X denote any estimator of 
X* based on the measurements of the form 0- Then, there exists a sufficiently small absolute 
constant c > 0 such that with a probability more than one half the estimation error over lik,r exceeds 


ca 


k log ^+r-(A:Vp2) 


'lk,r 


Namely, we have the minimax lower bound 


inf sup P 

X x*eXfc,,. 




> ca\ 


'klog^ +r{kVp2) 


'yk,r 


> 


1 

2' 


In words. Theorem shows that if 'jk^r = 0(1)) then the error of any estimator of the matrices 
in 'Kk,r cannot be uniformly better than 0 k log ^ + r (kV P 2 )) with high probability. For 
instance, this result implies that the bound established in Corollary is near-optimal as the error 
bound is within log (^) factor of the lower bound. We would like to emphasize that, we did not 
assume the nested structure ([^ for A to prove the lower bound. Therefore, the lower bound applies 
to any linear operator A obeying ([^ regardless of whether it is of the form Q or not. 











3 Applications and Extensions 


3.1 Blind deconvolution with randomly coded masks 

The linear measurements of the form (1^ appear in some interesting and important applications in 
computational imaging. For example, can describe the measurements in the lifted formulation 
of blind deconvolution problems that arise in imaging systems with randomly coded masks Halls]. 
In these problems rank-one target matrix is X = uh* with u and h being the target image and 
the Discrete Fourier Transform (DFT) of the unknown blurring kernel, respectively. The blind 
deconvolution can be solved effectively, by estimating the matrix X from the linear measurements 
of the form 


^(X) =^*(XoF). 

where F denotes the DFT matrix, o denotes the Hadamard product, and ^ is a matrix whose 
columns model the random masks that modulate the image u. We assume that the image u is 
sparse with respect to an orthonormal basis that is incoherent with the canonical basis. This 
assumption is realistic because, if necessary, spectral modulation can always create an effective 
sparsifying basis for the target image that has the desired incoherence condition. To simplify the 
exposition, let us assume that the image is sparse in the DFT basis and we have u = Fu with u 
being a fe-sparse vector. Therefore, instead of X one may consider X = uh* as the target rank-one 
matrix which is also row-wise sparse. Now if the random masks F are designed to suppress all but 
a small number of randomly chosen pixels indexed by I?, then the measurement model reduces to 

A (x) = F*^ (FoX o Fn) , 

where the subscript I? denotes restriction to the rows indexed by 17. It is now clear that with 
F = Fq and W : B vec Fq)), the above equation is a special case of ([^ up to a trivial 

vectorization. 


3.2 Low-rank and doubly-sparse matrices 


In this paper, we chose to state the main theoretical results only for the low-rank and row-wise 
sparse model mentioned in Section 1.2 However, these results can also be easily extended to the 
case of low-rank and doubly sparse matrices, where the target matrix is sparse both row-wise and 
column-wise. These kinds of matrices occur, for instance, in compressive phase retrieval, elaborated 
on in the following subsection, and covariance sketching H). The appropriate nested form of the 
measurement operator for these problems is 


>I(X) = W{^iX^l), 


(7) 


with W, Fi, and F 2 given. The only modification needed for the estimator would be to use 
and perform the first sparse estimation stage for row-wise sparsity as before, and then use F 2 to 
perform a second sparse estimation stage for column-wise sparsity. 
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3.3 Compressive phase retrieval 


An interesting special case of the extension described above by ([^ is when 'I'l = '^2 and the lin¬ 
ear operator W measures inner product of its argument and some rank-one matrices WiW* (i.e., 
yV : B [{wiW^, B)]^^^). In particular, this model can be used in Compressive Phase Re¬ 
trieval (CPR) |31| [ 39 ] where the target matrix X* = xx* with a sparse a; G is estimated 

from measurements |(aj,a;)| . Therefore, the measurement operator in CPR can be written 

L J 2=1 

as ^ ; X I— 7- [{uia*, X)Yl^^. The CPR problem is posed as non-convex optimization problems in 
[Ml Ea [38] where the sparsity of the solution is minimized subject to a constraint on the quar- 
tic prediction error or vice versa. While certain local convergence guarantees are established for 
the GESPAR algorithm |38] . global convergence and statistical accuracy of the proposed algorithm 
remains unknown. In [32], another non-convex approach based on alternating minimization is pro¬ 
posed for the standard Phase Retrieval (PR) as well as the CPR. With the particular initialization 
proposed in |32| , the alternating minimization method is shown to converge linearly in the noise-free 
PR and CPR problems. However, this convergence rate holds for CPR when the number of mea¬ 
surements grows quadratically in k. For large-scale problems this alternating minimization method 
would still be a favorite choice among the competing algorithms as it is computationally less de¬ 
manding. Furthermore, assuming iid Gaussian measurement vectors a*, [33] and m consider a 
convex relaxation to the lifted CPR problem formulated as 


trace (X) -|- A ||X||^ 
subject to A (X) = y, 


argmm 


for some parameter A > 0. For the considered type of measurements, m shows that the above 
convex program can recover the target sparse and rank-one matrix X* if the number of measure¬ 
ments depend quadratically on k, the number of the nonzeros of the signal x. The same result is 
shown in m for sub-Gaussian measurements and through a simpler derivation by “debiasing” the 
measurements and showing the obtained operator obeys an ii/i 2 variant of the restricted isometry 
property. Furthermore, m demonstrates that the quadratic dependence on k cannot be improved 
fundamentally by varying the coefficient A which is in agreement with the results of |34) . 

As mentioned above, if we choose a* = ^*Wi for some compressive sensing matrix and random 
vectors then the measurement operator takes the nested form A : X [{•WiW*,^X^*)]^^^. 

It is then straightforward to apply the framework developed in this paper and show that X* can 
be reconstructed accurately and efficiently from n <C p measurements obtained by A through the 
two-stage recovery 


B G argmin trace (S) 


( 8 ) 


subject to {w*Bwi — yi)^ < 


2 = 1 


X G argmin ||X|| 
X 


(9) 


subject to 


^X^* - B 


< — 
F ~ Jn' 


where e > II 2 II 2 is a bound on the noise, H-H]^ denotes the ii norm, and C is an absolute constant. The 
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precise guarantees for effectiveness of nested measurements in CPR are established independently 
in |21) . |46) . and [2]. In particular, the following is shown in |2]. 

Theorem 3 (|2l Corollary 1]). Let^ G be a matrix with independent N(0, entries, and for 

i = 1,2,... ,n let Wi € M'” be independent copies of a vector with i.i.d. standard Gaussian entries. 
For an arbitrary k-sparse signal x ^ we observe measurements of the form yi = x)‘^ + Zi 

. If m > Cl/clog I and n > C 2 m for sufficiently large absolute constant ci and C 2 , then the two-stage 
recovery through ([^ and (§ produces an estimate X that obeys 


X — xx^ 



with high probability for some absolute constant C > 0. 


The fact that the intermediate operator W : B [{wiW*,B)f^^.^ is generally not a restricted 
isometry precludes our minimax analysis. However, as mentioned in Section |1.3| the restricted 
isometry conditions are used merely to obtain uniform accuracy guarantees. The desired uniform 
accuracy guarantees can be established through other approaches, as done in [25 using the small-ball 
method EQ]. Moreover, if we disregard uniform guarantees and thus minimax optimality, instance 
accuracy guarantees of the low-rank estimation stage also can be established through the small-ball 
method m or the construction of a dual certificate m- Therefore, we can show that the CPR 
problem can be solved in our proposed framework accurately with a sample complexity that grows 
much slower than the ambient dimension of the target signal. 


4 Numerical Experiment 


We performed a set of simulations on synthetic data in which both ^ and W are considered to be 
Gaussian as in Corollarywith m = \5k log and n = 4r (m V p 2 ). We generated target matrices 
of size 1000 X 30 (i.e., pi = 1000 and p 2 = 30) that are factored as X* = UV'^ where U G 
whose k nonzero rows are chosen uniformly at random and V G The nonzero entries of both 

U and V are independent draws from the standard Gaussian distribution. The noise variance is 
fixed at cj^ = 10“^. The rank of the target matrix is selected from the range 1 < r < 10. Similarly, 
the row-wise sparsity is selected from the range 10 < A: < 19. For each pair of (r, k) the algorithm 
is tested for 100 trials. We used the TFOCS package |6] for the low-rank estimation stage Q. 
We also used a variant of the Alternating Direction Method of Multipliers (ADMM) adapted from 
for the sparse estimation stage ([^. Figure illustrates the variation of the empirical median 

X — X* (i-e., the normalized squared reconstruction error) versus k and r. The spread 


of 


1 


around each data point indicates the distribution of the normalized squared reconstruction error. 

2 


As can be seen from the figure. 


X -X^ 


of them is fixed which is in agreement with t 


grows almost linearly with respect to r and k if one 
re theoretical results. 


We also ran a simulation for the CPR described in Section 3.3 on a two-dimensional image of 
SaturnQ The target image has size 1, 280 x 1,024, thus the ambient dimension is p = 1, 280T, 024 = 
1,310,720. The target is compressible in the wavelet domain. In particular, with eight levels of 


^Adapted from NASA’s Voyager 2 image, 1981-08-24, NASA catalog 7^PIA01364 
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||2 


(a) ^ ||x — -?^*|| vs. r for k £ {10,14,18} 



(b) ^ llx — JC*|| vs. k for r £ {2, 6,10} 
II IIf 


Figure 1: Normalized squared reconstruction error for different values of rank (i.e., r) and row-wise 
sparsity (i.e., k) 
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(a) original (b) non-sparse phase retrieval (c) sparse phase retrieval 

Figure 2: Compressive phase retrieval of Saturn from nested measurements. While ordinary phase 
retrieval yields a poor estimate, the proposed two-stage approach that exploits sparsity has produced 
an estimate with relative error less than 5%. 


two-dimensional 30-tap Coiflet wavelets |18) we considered the sparsity of about k = 3, 000. The 
goal is to estimate the sparse/compressible wavelet coefficients of the target from the phaseless 
measurements of the form y = where the modulus is taken entrywise. Here x represents 

the vectorized target image, ^ effectively performs a Rademacher modulation followed by m = 
[2A:(1 -|- log |)] = 42,479 random Fourier measurement, and W makes Fourier measurements at m 
randomly chosen frequencies for each of C = 20 masks with octanary patterns described in |12) . 
Therefore, the matrix ^ has m = 42,479 rows, and the total number of measurements (i.e., the 
number of rows in W) is n = Cm = 849, 580. We compared the performance of our two-stage 
approach with the estimate produced through ordinary phase retrieval. In both cases, we used 
500 iterations of the Wirtinger-Flow algorithm |12) as the phase retrieval solver. For the sparse 
recovery stage of our proposed method we relied on 100 iterations of the IHT algorithm |8]. Figure 
l^shows, respectively from left to right, the original image, the recovered image using ordinary phase 
retrieval, and the recovered image through the proposed two-stage CPR. As counting the degrees of 
freedom would suggest, the ordinary phase retrieval is expected to fail given that we only obtained 
n ~ 0.65p measurement. The two-stage recovery, however, produced an estimate with relative error 
of less than 5%. The specified algorithms and measurement schemes allowed us run the simulation 
without significant memory or computational requirements. Without computational and memory 
restrictions, the two-stage method can achieve a similar accuracy at even lower sampling rates (i.e., 
n/p) by using generic (e.g., Gaussian) measurements. 
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A Auxiliary tools 

To facilitate readability of the proofs, in this section we state the most important anxiliary resnlts 
in onr derivations that we borrowed from the literatnre. 

Theorem 4 (Fano-type ineqnality |42| Theorem 2.5]). Let d (•, •) be a metric and D (P || P^) denote 
the Kullback-Leibler divergence of probability measures P and PL Assume that M >2 and suppose 
that 0 contains elements 9 q,9i, ..., 9m associated with probability measures Pj := P^^ such that: 

1. d{9j,9k) >2s>0, yO<j<k<M, 

2. Pj is absolutely continuous with respect to Pq for all j = 1,2,..., M, and 



with 0 < a < |. 


Then 



Lemma 1 (Varshamov-Gilbert bonnd variation |29[ Lemma 4.10]). Let {0,1}'^ 6e equipped with 
Hamming distance din o-nd given 1 < D < define {0,1}^ = |a; G {0,1}'^ | d\-\ (0, x) = Zl|. Eor 

every a G (0,1) and fi G (0,1) such that D < afiN, there exists some subset 0 of {0,1}^ with the 
following properties 


dH (d, 9') >2{l-a)D^ {9,9') G 0^ with 9 9', 



where 
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B Proofs 


B.l Proof of the minimax upper bound 

The following lemma shows the accuracy of the first stage of the estimation procedure. The proof 
is analogous to the standard analysis of compressive sensing and low-rank matrix estimation based 
on the restricted isometry property. In particular, our derivations are similar to that of |13| . We 
also refer to some of the lemmas proved in |13) without repeating their statements explicitly. 

Lemma 2 (Low-rank reconstruction). Suppose that n = 0 (r (m V P 2 )) CLnd the constant ci in 0 
is sufficiently large to guarantee that B* is feasible. If the restricted isometry constant (kV) < 
^ then for some constant C 2 > 0 depending only on the restricted isometry constants, the pre¬ 


estimate B obtained from obeys 


B-B* 


< c^o^rijn V pf), 


with high probability. 


Proof. It follows from optimality of B and feasibility of B* in that 


B 


< IIS* 


( 10 ) 


Let UBV'^ be the (compact) singular value decomposition of the rank-r matrix B* G 
Denote the subspace of matrices that are “supported” on U and V by 

To = |s G I (^I - b(^I- = o| . 

Furthermore, let S = S — B* and define Eq to be the projection of E onto the subspace Tq. We 
have 


|s*lL > 


B 


= \\B* + E — Eq + Eq\ 


> \\B* + E - Soll^ - ||So||^ 

= l|S*|L + ||S — SnlL — ||Sn 


where the first and the second line follow from (10) and the triangle inequality, and the third line 


follows from the fact that B* and E — Eq have mutually orthogonal columnspaces and rowspaces 
since E — Eq G (see |36l Lemma 2.3]). Therefore, we deduce that 


|S - SnIL < IIS 


OIL ^ ll-C/OlU ^ 


< ||Si 


oIIf ' 


( 11 ) 


Using the definition of the measurement y = W (B*) -|- 2 : we can write 


W[B]-y 


= ||W (E) - ^11^ = ||W (S)ll^ - 2 {W (E ), z) + ||z||^ 

>||W(S)||2-2||S|U|W* (;s)|| + ||z||^ 
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It is shown in m Lemma 1.1] that for a sufficiently large constant c > 0 we have HW* {z)\\ < 
cay/m V p 2 with probability exceeding 1 — where cq is an absolute constant. Therefore, given 

the feasibility of B and ^ we obtain 

IIW {E)\\l < 2ccr ||£;||^ a/tuV^ + 2cicr^r (m V P 2 ) ■ 


Then, using the triangle inequality ||£^||^ < ||T^o||* + 11-^ “ -^olL and (11) we deduce that 

||W(T;)||2 < AcaV^WEo 11^ y/m\/ p 2 + 2ci(T^r (m V P 2 ) 

< A:ca^/2r\\E\\p^/mV^ + 2cia‘^r {m\/P 2 ) ■ ( 12 ) 

For z = 1, 2,..., dehne the matrices Ei recursively as the best rank-2r approximation of —X]}=o 
until Ei = 0. Clearly, we have 

E = '£Ej, 

j>0 

and thereby 

W{E) = ^W{E,). 
j>o 

To simplify the notation we use the shorthand 6 ,^ 4 r = (W) below. It follows from |13[ Lemma 

3.3] that for any pair of distinct indices j and f we have 

{W {Ej) ,W {E,,)) > -6.,^r\\Ej\\p\\Ej,\\j,. 

Therefore, we can expand \\W (T ^)||2 and write 

2 

Ww{E)\\l= 

i>o 2 


= \\W[Eo) + W{E^)f 2 + 


E 


i>2 


+ 2 (W {Ei,) + W {E ,), W {Ej)) 
2 ^'^2 


\\W{Eo) + W{E,)\\l + Y,\\y^{E, 


3)\\2 


i>2 


+ 2 ^ {W{Ey),W{Ej)) + 2 Y,{^{Eo)+y^{Ei),W{Ej)) 

/>i>2 i>2 


— (1 “ j ll^'O + -E'lllir + 'y ^ \\E^ 


3.1 


i>2 


‘^^•Ar'^^i\\Eo\\p + \\Ei\\p) \\Ej\\p — 26,^4r ll-E', 

i>2 i'>j>2 


3 \\F ll-^i'IlF 


— II-^'IIf ~ '^•,4r ( \\Eo + + 2 (||TJo|li? + I|-^i|If) ll-^illF 

i>2 

<5.,4r j \\Ej\\P 

J >2 


( 13 ) 
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Because Eq and Ei are orthogonal we have ||£^o|li;’ + ll-£'i||_p ^ ll-E'O + ^iWf- Furthermore, the 
construction of Ej guarantees that 

\\Ej\\F<^\\Ej-i\l, 

for j > 2. Since for J > 1 both the columnspaces and the rowspaces of the matrices Ej are mutually 
orthogonal, we can invoke |36[ Lemma 2.3] and write 


J>2 

Therefore, in view of we obtain 


J>2 


— \\E — EnW^. 


j>2 j>2 


Eii^j 


i-i| 




— Eo\\. < llT^ollF < \\Eq + E 


i|If • 


Therefore, we can simplify the bound (13) to 


liw (t;)||^ > ||t;||^ -2(1 + V2J < 5 ., 4. ||t;o + 
> (1 - 2 (1 + \/2) 5 ., 4 ,) 


If d,^ 4 r < ^ then (12) and (14) yield 


\\E\\p < C 2 (y-\/ r{m\l P 2 ), 

for some constant C 2 > 0 that only depends on the restricted isometry constants of W. 


(14) 


□ 


We are now ready to prove Theorem 

Proof of Theorem\^ Recall that B* = ’EX*. Lemma [^guarantees that B obtained from (j^ obeys 


EX* - B 


B* - B 


< C2a^yr{mV P 2 ), 


where C 2 is constant depending only on the restricted isometry constants of W. Therefore, X* 
is feasible for (|^. The fact that E is also a restricted isometry allows us to apply results from 
compressive sensing of block-sparse signals (see, e.g., m Theorem 2 and Section VI]) and obtain 


X - X" 


<Ca^ (m V P 2 ), 


where C > 0 depends on 5,^4r (W) through C 2 and on d 2 k,, 


□ 
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B.2 Proof of the minimax lower bound 


We follow a standard strategy to establish the minimax lower bound using information theoretic 
tools. The minimax probability of inaccurate estimation over ^k,r can be bounded from below by 
the same quantity over any subset of ^k,r, particularly the hnite subsets. This probability can be 
further relaxed to the minimax probability of error in a hypothesis testing problem dehned for the 
same subset. Therefore, to obtain a tight lower bound on the minimax probability of error, it suffices 
to construct a large set of hypotheses that are difficult to distinguish based on the observations. In 
particular, we need to choose a hnite but sufficiently large number of potential target matrices in ^k,r 
such that they are well-separated but they are difficult to distinguish from their noisy compressive 
measurements. To this end, we construct two different hnite subsets H' and H" of the set of pi x p 2 
matrices of rank no more than r that have at most k nonzero rows and Frobenius norm at most e 
which is denoted by 


^k,r,e ■— £ ^k,r \ II^IIf — ' 

Construction of these two subsets depend critically on the choice of the sets S, T', and T" that 
are subsets of {S C [pi] | |S| = k}, {±1}^^^^, and {±1}^^'", respectively. Elements of the set S 
determine the indices of the nonzero rows, whereas the elements of T' and T'' determine the value of 
the nonzero entries. As explained below, we rely on Lemma a variant of the Varshamov-Gilbert 
bound established in |29) . to appropriately choose S, T', and T". Finally, using the hypothesis sets 
X' and X" we derive a minimax lower bound by invoking Theorem which is one of the Fano-type 
inequalities established in |42) . 


Proof of Theorem^^ Given S, T', and T" dehne 


X' 


e 


Ipi,S 


^1 i-fei I ^ T £ T' 


and 


® r?! ) I S S and T e T"} , 

In words, each matrix in X' basically consists of ^ copies of an r x p 2 binary matrix that are stacked 
on top of each other and interleaved with all-zero rows. Clearly, the matrices in H' have k nonzero 
rows, are of rank at most r, and have Frobenius norm of e, showing that X' C ^k,r,e- Similarly, each 
matrix in X'' is essentially a horizontal concatenation of ^ copies of an /c x r binary matrix that is 
interleaved row-wise with all-zero rows. It is straightforward to verify that the matrices in X" have 
k nonzero rows, are of rank at most r, and have Frobenius norm of e, which show that H" C ^k,r,e 
as well. 

We use the Varshamov-Gilbert bound as stated in Lemma to choose sufficiently large sets S, T', 
and T". Treating each of the sets in S as a binary sequence of length pi, Lemma guarantees 
existence of a set § of A;-subsets of [pi] such that 

log|S| > d,,iSi,S2)>]k,yiSi,S2)eS^Si^S2. 

zb k 4 
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We can also treat each matrix in T' as a binary string of length rp 2 , apply Lemma and show that 
there exists a set T' of matrices in that satishes 


log |T'| > ^rp 2 , and dn (Ti, T 2 ) > ^rp 2 , V (Ti, T 2 ) G T'^, Ti / T 2 . 

Zo o 

Similarly, there exists a set T" of matrices in {±1}^^’’ such that 

log|T"| > ^rk, and dy, (Ti,T 2 ) > Kk, V(Ti,T 2 ) G T"^ Ty / T 2 . 

Zo o 


Let (5i,Ti) and (S' 2 ,T 2 ) be two distinct pairs in S x T' using which we can construct the matrices 
Xy and X 2 in X', respectively. If 7 ^ S 2 , then counting only the rows of Xy — X 2 that are not 
in the set 5i n S 2 we obtain 

||Xi - X2||^ > ^ — y/d\-\ (5i, S2)p2 > X- 

\/kp2 2, 

Furthermore, if Sy = S 2 and Ty 7 ^ T 2 , then we have 


Xl-X2||^ > 


2e 

\fl^ 


dH (Ti,T2) 




Therefore, the above inequalities and the fact that log |X'| = log |S| + log |T'| show that the set X' 
obeys 


4 . , Pi 3 


log|X'| > ^klog^ + ^rp 2 , and 
Similarly, we can show that X" obeys 


\Xy-X 2 \\F>^, V(Xi,X 2 ) GX'^ Xi 7 ^X 2 (15) 


log|X"| > A/clog^ + i/c, 


and 


|Xi-X2||^> -, V(Xi,X2) G 




Xy ^ X 2 (16) 


For any matrix X G denote the Gaussian distribution N (^(X) ,cr‘^l) which is the 

distribution of the measurement y if X is the target matrix. Recall that by assumption ^ is a 
restricted isometry over ^k,r as dehned by ^ with the restricted isometry constant S = 6k,r (-4,). 
For any X G X' C ^k,r the KL-divergence between Px and Pq can be bounded as 


DiP: 


'0 = 


\\A{X)\\l ^^ Jk,r\\XfF 


2(72 - 20-2 20-2 ’ 

where the inequality follows from restricted isometry assumption on A. Therefore, we have 


1 


D{Px\\Po)< 

xex' 




Suppose that we have 


7fc,re^ 

2ct2 


< a'log|X'| , 
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holds for some ol G (O, Then, because any matrix X G X' also satishes ||X — 0||j;. = ||-Xi||^ 
e > we can use (|15[) and invoke Theoremto guarantee that 


mf sup P 

X X*£Xk^r,e 


X -X* 



“ l + \^ 



From (15) we have the crude lower bound log|X'| > Therefore, with a' = 5 x 10 ^ we can 
choose 


e = 10 


' A: log ^ + rp 2 


'lk,r 


and obtain 


inf sup P 

X X* 


X -X^ 


o k log ^ + rp2 \ 1 

> 2.5 X 10~VW ^ ' 


> 


'Ik,', 


Furthermore, for the set X" if we have 


(1 + h) 

2^2 


//I_I ^// 1 


< a log 


for some G (O, |), then using (16) we can apply Theorem and show that 

^ 


mf sup P 


X -X* 


>il> 


1 - 2a" - 


2 a" 


F 4/ l + y|^\^^ “ 

Similar to the argument for X', with a" = 5 x 10“^ we can choose 


V 7 fc,r 


and obtain 


mf sup P 

X X* 


X -X* > 2.5 X lO-V W"^ ^ - I > -. 


I k log ^ + rk\ 1 

2' 


Ik,: 


Then with e = 2.5 x 10 '^a 


we can deduce from (17) and (18) that 


(17) 


(18) 


mf sup P 


X -X^ 


o I klog ^ + r (k \/ p2)\ 

> 2.5 X 10 "VW-^^- — 

\ Ik.v 


1 

> nl 


which also guarantees the desired minimax lower bound on ^k,r- 


□ 
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